Skip to content

Conversation

@ayushag-nv
Copy link
Contributor

@ayushag-nv ayushag-nv commented Nov 6, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for multimodal decode worker configuration, enabling disaggregated multimodal serving with independent decode components.
    • Introduced new launch script for orchestrated deployment of multimodal models across separate frontend, processor, encoding, prefill, and decode workers.
  • Chores

    • Removed legacy multimodal deployment script.

Signed-off-by: ayushag <[email protected]>
@ayushag-nv ayushag-nv requested review from a team as code owners November 6, 2025 05:54
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ayushag-nv ayushag-nv marked this pull request as draft November 6, 2025 05:54
@github-actions github-actions bot added the chore label Nov 6, 2025
@rmccorm4 rmccorm4 added backend::vllm Relates to the vllm backend multimodal labels Nov 6, 2025
Signed-off-by: ayushag <[email protected]>
Copy link
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! Please remove all the debug logs and the old disagg_multimodal.sh file in the example/multimodal folder before merging.

Signed-off-by: ayushag <[email protected]>
@ayushag-nv ayushag-nv marked this pull request as ready for review November 7, 2025 22:00
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

Added multimodal decode worker support to the vLLM configuration system, updated handler initialization logic to route between decode and prefill+worker handlers based on configuration, introduced a new disaggregated multimodal serving script, and removed an obsolete orchestration script.

Changes

Cohort / File(s) Summary
Configuration & Arguments
components/src/dynamo/vllm/args.py
Added multimodal_decode_worker flag to Config class; introduced --multimodal-decode-worker CLI argument; integrated flag into multimodal aggregation; updated error messaging; added conditional logic to set component to "decoder" and endpoint to "generate" for decode worker; adjusted prefill+worker routing to use "backend" component.
Worker Initialization & Handler Routing
components/src/dynamo/vllm/main.py
Expanded multimodal worker initialization to select between MultimodalDecodeWorkerHandler and MultimodalPDWorkerHandler based on config.multimodal_decode_worker; added decode worker client creation and connection for disaggregated mode; updated MultimodalPDWorkerHandler signature to accept decode_worker_client instead of downstream_client; extended multimodal component initialization conditions to include decode worker flag.
Orchestration Scripts
examples/backends/vllm/launch/disagg_multimodal.sh
New Bash script for disaggregated multimodal serving; parses --model and --prompt-template CLI options; selects template defaults for llava-1.5-7b-hf, Phi-3.5-vision-instruct, and Qwen2.5-VL-7B-Instruct; launches frontend, processor, and three workers (encode, prefill, decode) on distinct GPUs; applies model-specific GPU memory tuning.
Legacy Scripts
examples/multimodal/launch/disagg.sh
Removed orchestration script that launched multimodal workflow with Ingress, processor, and worker services.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Key areas requiring attention:

  • Verify handler selection logic correctly routes between MultimodalDecodeWorkerHandler and MultimodalPDWorkerHandler based on configuration
  • Review decode worker client initialization and connection flow in disaggregated mode, especially component and endpoint routing changes
  • Validate CLI argument parsing and Config propagation for multimodal_decode_worker flag
  • Confirm GPU memory tuning parameters and process launch ordering in the new disaggregated multimodal script are correct for target models

Poem

🐰 A decode worker hops into view,
New flags and handlers, so fresh and new!
Disaggregated dreams split across GPUs bright,
While old scripts retire—farewell to the night!
Multimodal magic, now routing with care,
Through prefill and decode, everywhere! 🌟

Pre-merge checks

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely empty template placeholders with no actual content, making it impossible to understand the purpose or scope of changes. Fill in all sections with concrete details: describe the multimodal disaggregated serving changes, specify files to review, and link the related GitHub issue.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'chore: mm epd disagg' is vague and uses unclear abbreviations that don't clearly convey what changes were made. Expand abbreviations and be more specific about the change: e.g., 'Add disaggregated multimodal serving configuration with decode worker support' or similar.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
examples/backends/vllm/launch/disagg_multimodal.sh (2)

73-104: Consider adding health checks between component launches.

The script launches all components in rapid succession without waiting for dependencies to be ready. For example, the encode worker (line 89) may start before the processor (line 79) is ready to receive connections.

Consider one of these approaches:

Option 1: Add sleep delays between launches

 # Start processor
 echo "Starting processor..."
 python -m dynamo.vllm --multimodal-processor --model $MODEL_NAME --mm-prompt-template "$PROMPT_TEMPLATE" &
+sleep 5
 
 # Configure GPU memory optimization for specific models

Option 2: Add health check polling (more robust)

After each component launch, add a function to poll its health endpoint:

wait_for_service() {
    local port=$1
    local max_attempts=30
    for i in $(seq 1 $max_attempts); do
        if curl -s "http://localhost:$port/health" > /dev/null 2>&1; then
            return 0
        fi
        sleep 1
    done
    echo "Service on port $port failed to start"
    exit 1
}

87-97: Document or validate GPU availability.

The script assumes GPUs 1, 2, and 3 are available but doesn't validate this. Consider adding a GPU count check at the start or documenting the minimum GPU requirement.

Add GPU validation:

# After line 71, before starting components
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
if [ $GPU_COUNT -lt 4 ]; then
    echo "Error: This script requires at least 4 GPUs, but only $GPU_COUNT found"
    exit 1
fi

Or document the requirement in the help text:

 echo "Disaggregated multimodal serving with separate Encode/Prefill/Decode workers"
 echo ""
+echo "Requirements: At least 4 NVIDIA GPUs"
+echo ""
 echo "Options:"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6336eea and ecda34b.

📒 Files selected for processing (4)
  • components/src/dynamo/vllm/args.py (5 hunks)
  • components/src/dynamo/vllm/main.py (3 hunks)
  • examples/backends/vllm/launch/disagg_multimodal.sh (1 hunks)
  • examples/multimodal/launch/disagg.sh (0 hunks)
💤 Files with no reviewable changes (1)
  • examples/multimodal/launch/disagg.sh
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-28T04:09:48.264Z
Learnt from: ayushag-nv
Repo: ai-dynamo/dynamo PR: 3634
File: components/src/dynamo/vllm/multimodal_handlers/processor_handler.py:66-72
Timestamp: 2025-10-28T04:09:48.264Z
Learning: In components/src/dynamo/vllm/multimodal_handlers/processor_handler.py, the AutoTokenizer.from_pretrained call with trust_remote_code=True is intentional and expected for the vLLM multimodal handler implementation.

Applied to files:

  • components/src/dynamo/vllm/main.py
📚 Learning: 2025-06-05T01:04:24.775Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: launch/dynamo-run/src/subprocess/vllm_v1_inc.py:71-71
Timestamp: 2025-06-05T01:04:24.775Z
Learning: The `create_endpoint` method in `WorkerMetricsPublisher` has backward compatibility maintained through pyo3 signature annotation `#[pyo3(signature = (component, dp_rank = None))]`, making the `dp_rank` parameter optional with a default value of `None`.

Applied to files:

  • components/src/dynamo/vllm/main.py
🧬 Code graph analysis (1)
components/src/dynamo/vllm/main.py (3)
components/src/dynamo/vllm/multimodal_handlers/worker_handler.py (2)
  • MultimodalDecodeWorkerHandler (25-81)
  • MultimodalPDWorkerHandler (84-260)
lib/bindings/python/src/dynamo/_core.pyi (5)
  • namespace (42-46)
  • component (88-92)
  • endpoint (117-121)
  • client (154-158)
  • wait_for_instances (193-200)
lib/bindings/python/rust/lib.rs (5)
  • namespace (491-496)
  • component (815-821)
  • endpoint (703-709)
  • client (785-799)
  • wait_for_instances (840-849)
🔇 Additional comments (9)
components/src/dynamo/vllm/args.py (3)

72-72: LGTM! Consistent flag implementation.

The multimodal_decode_worker flag is properly integrated across class attributes, CLI arguments, exclusivity checks, error messaging, and config propagation.

Also applies to: 174-178, 227-227, 232-232, 274-274


245-249: LGTM! Clear component routing for decode worker.

The decode worker correctly uses "decoder" as the component name to enable prefill worker connections in disaggregated mode.


250-253: LGTM! Critical routing for multimodal prefill worker.

The multimodal prefill worker correctly uses "backend" as the component name to maintain the encoder→backend connection, which differs from the standard prefill worker component naming. The comment clearly explains this design decision.

examples/backends/vllm/launch/disagg_multimodal.sh (3)

1-6: LGTM! Proper script initialization.

Good use of set -e for early exit on errors and trap for cleanup of background processes.


12-47: LGTM! Clear CLI interface.

The command-line argument parsing is well-structured with helpful usage information and examples.


49-63: LGTM! Sensible template defaults.

Model-specific prompt templates are properly defined with a clear fallback mechanism for unsupported models.

components/src/dynamo/vllm/main.py (3)

32-32: LGTM! Clean import addition.


109-113: LGTM! Correct routing for decode worker.

The condition appropriately routes multimodal_decode_worker to init_multimodal_worker, where handler selection occurs based on the specific worker type.


639-660: LGTM! Correct handler selection and client wiring.

The logic correctly distinguishes between:

  1. Decode worker (multimodal_decode_worker=True): Uses MultimodalDecodeWorkerHandler without needing a downstream client
  2. Prefill worker (is_prefill_worker=True): Creates decode_worker_client and passes it to MultimodalPDWorkerHandler for disaggregated mode

The handler signatures match the constructor definitions, and wait_for_instances() is properly called before use (line 649).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants